Group 30: Phase 0 - Cats vs Dogs Detector (CaDoD)

Team Neuro Members:

image.png

Project Abstract

One of the biggest challenges in the field of computer vision is Image Classification and detection which has varied applications starting from the field of medicine, space object detection and so on. The aim of this project is to perform image classification analysis on a dataset to differentiate between cat and dog images using classification and regression models. We first preprocess the images from the image catalogue using different function metrics and using the RGB intensity value arrays we feed them into classifiers to predict labels(cat/dog) and bounding boxes for which we will implement baseline model Logistic Regression as our classifier and optimize using stochastic gradient descent at an adaptive learning rate, homegrown logistic regression and linear regression to predict class, bounding boxes and loss function attributes using both sklearn and pytorch. We then plan to extend our model training by implementing Convolutional Neural Network(CNN) model for single object detection using Pytorch and use different performance metrics such as RMSE, MSE and accuracy to measure our model performance.

Project Description:

Our aim for this project is to build object detection pipelines using Python, OpenCV, SKLearn, and PyTorch to detect cat and dog images.We import the image catalogue data, perform Exploratory data analysis on it, derrive some metrics and baseline models on the data.In order to create a detector, we will first have to preprocess the images to be all of the same shapes, take their RGB intensity values and flatten them from a 3D array to 2D. Then we will feed this array into a linear classifier and a linear regressor to predict labels and bounding boxes.

Build an SKLearn model for image classification and another model for regression Implement a Homegrown Logistic Regression model. Extend the loss function from CXE to CXE + MSE Build a baseline pipeline in PyTorch to object classification and object localization

Build a convolutional neural network network for single object classifier and detector.

Data Description:

The data set consists of about 12,966 RGB images of cats and dogs with varying shapes and aspect ratios. The image bounding box coordinates are stored in a .csv file which contain image description, box coordinate descriptions along with some required attributes. We define some of the data attributes as below:

Import Data

Unarchive data

Load bounding box meta data

Exploratory Data Analysis

Statistics

Replace LabelName with human readable labels

Sample of Images

Image shapes and sizes

Go through all images and record the shape of the image in pixels and the memory size

Count all the different image shapes

There are a ton of different image shapes. Let's narrow this down by getting a sum of any image shape that has a cout less than 100 and put that in a category called other

Drop all image shapes

Check if the count sum matches the number of images

Plot

TODO plot aspect ratio

Preprocess

Rescale the images

Resized and Filtered Images

Plot the resized and filtered images

Comparing Images Before an After Resize

Checkpoint and Save data

Baseline in SKLearn

Load data

Double check that it loaded correctly

Classification

Split data

Create training and testing sets

Train

I'm choosing SGDClassifier because the data is large and I want to be able to perform stochastic gradient descent and also its ability to early stop. With this many parameters, a model can easily overfit so it's important to try and find the point of where it begins to overfit and stop for optimal results.

Did it stop too early? Let's retrain with a few more iterations to see. Note that SGDClassifier has a parameter called validation_fraction which splits a validation set from the training data to determine when it stops.

Evaluation

Import Required Libraries

Build Processing Pipelines

Modelling

Baseline Logistic Regression

Regression

Split data

Train

Mean Absolute Percentage Error Calculation :

Experiment Log for baseline LR:

Baseline LR with Lasso and Ridge Regularization

Best parameter Alpha:

Experiment Evaluation for Ridge/Lasso:

Random Forest Regressor with LR

Evaluation

Results / Discussion

Our objective of the classification model is to predict the class that is whether the image contains a cat or a dog. We have compared across varied different classification models and different parameters which resulted in a maximum accuracy of 57 % for Stochastic gradient descent and the next best came up for random forest.We could observe that the majority of models have the test accuracy in the range of 52-57%. We will try to get better results by simplifying the images and by applying dimension and feature reduction along wth deep learning algorithms to improve speed and accuracy.

With regression models we have predicted the bounding box coordinates along with regression metrics for three different regressors. We tried to implement Baseline Linear regrssor, LR with Lasso and Rigde Regularization and Random forest regressor. We could see that LR with Lasso and Rigde Regularization has provided better metrics.

Challenges :

The main challenges was to work with a huge dataset and when shifted to colab we were running out of RAM.

To overcome this issue we had to decrease the standardized image size from 128x128 to 32x32.

However this transformation lead to loss of information and predicted distorted images which caused poor prediction of images

Conclusion

In phase 1, we have focused on the SKLearn Baseline models for Logistic Regression, SGDClassifier to classify the images into cats and dogs and Linear Regression for marking the bounding boxes around the cats and dogs inside the image.

We have also implemented the Homegrown Logistic Regression and obtained accuracy about 52.6% and also calculated CXE+MSE loss functions.

Plan to implement multi task neural networks for our next phase and try to improve the models accuracy using pytorch, CNN and efficientdet detector.

PHASE 2

Homegrown cat/dog detector pipeline in Python and Numpy

PyTorch object detector pipeline

image.png

Homegrown Linear Regression : MSE Loss

Implement a Homegrown Linear Regression model that has four target values.Extend the MSE loss function from one target to four targets (x, y, w, h).

Homegrown Logistic Regression implementation(BSE+MSE+Regularization)

Implement a Homegrown Logistic Regression model. Extend the loss function from CXE to CXE + MSE, i.e., make it a complex multitask loss function where the resulting model predicts the class and bounding box coordinates at the same time

image.png

Home grown Linear Regression Model :

Mean Square Error = 48.5587

Plotting Homegrown Results for Classification and Regression:

image.png

Homegrown Logistic Regression Model :

Accuracy : ~ 52.7 %

In order to implement the homegrown version of logistic regression to classify and regress at the same time, we created a class called HomeGrownLogisticRegression that consists of important methods instrumental in training a model.

The input into the model was a 32x32x3 flattened numpy array of all the images.

Using gradient updating of weights, we maintained a theta matrix that learnt the weights on how to classify an image as cat or dog and at the same time we maintained another theta matrix that learns the weights to predict the Xmin, Xmax, Ymin and Ymax values using unnormalized distances in logistic regression, aka linear regression.

We observed that without a learning rate scheduler, we had to use a very small lr value to observe the CXE+MSE values reducing.

We obtained a classification accuracy of 52.7% on the validation data and a MSE on its way to convergence at the end of 1000 epochs.

PyTorch Object Detector Pipeline

Multilayer Perceptron

In this series we'll be building machine learning models (specifically, neural networks) to perform image classification using PyTorch and Torchvision.

In this first notebook, we'll start with one of the most basic neural network architectures, a multilayer perceptron (MLP), also known as a feedforward network. The dataset we'll be using is the famous MNIST dataset, a dataset of 28x28 black and white images consisting of handwritten digits, 0 to 9.

image.png

We'll process the dataset, build our model and then train our model. Afterwards we'll do a short dive into what the model has actually learned.

Data Processing

Let's start by importing all of the modules we'll need. The main ones we need to import are:

Imports

Spliting Data Into Train and Test

Normalization by Subtracting Mean and Dividing with Standard Deviation

Spliting Data Into Train and Validation

To ensure we get reproducible results we set the random seed for Python, Numpy and PyTorch.

Now we have defined our transforms we can then load the train and test data with the relevant transforms defined.

Next, we'll define a DataLoader for each of the training/validation/test sets. We can iterate over these and they will yield batches of images and labels which we can use to train our model.

We only need to shuffle our training set as it will be used for stochastic gradient descent and we want the each batch to be different between epochs. As we aren't using the validation or test sets to update our model parameters they do not need to be shuffled.

Ideally, we want to use the biggest batch size that we can. The 64 here is relatively small and can be increased if our hardware can handle it.

MLP Model For Image Classification With Out Using Drop Out:

We'll define our model by creating an instance of it and setting the correct input and output dimensions.

We can also create a small function to calculate the number of trainable parameters (weights and biases) in our model - in case all of our parameters are trainable.

The first layer has 3072 neurons connected to 250 neurons, so 3072*250 weighted connections plus 250 bias terms.

The second layer has 250 neurons connected to 100 neurons, 250*100 weighted connections plus 100 bias terms.

The third layer has 100 neurons connected to 2 neurons, 100*2 weighted connections plus 10 bias terms.

$$3072 \cdot 250 + 250 + 250 \cdot 100 + 100 + 100 \cdot 2 + 2= 222,360 $$

Training the Model

Next, we'll define our optimizer. This is the algorithm we will use to update the parameters of our model with respect to the loss calculated on the data.

We aren't going to go into too much detail on how neural networks are trained (see this article if you want to know how) but the gist is:

We use the Adam algorithm with the default parameters to update our model. Improved results could be obtained by searching over different optimizers and learning rates, however default Adam is usually a good starting off point. Check out this article if you want to learn more about the different optimization algorithms commonly used for neural networks.

Then, we define a criterion, PyTorch's name for a loss/cost/error function. This function will take in your model's predictions with the actual labels and then compute the loss/cost/error of your model with its current parameters.

CrossEntropyLoss both computes the softmax activation function on the supplied predictions as well as the actual loss via negative log likelihood.

Briefly, the softmax function is:

$$\text{softmax }(\mathbf{x}) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

This turns out 10 dimensional output, where each element is an unbounded real number, into a probability distribution over 10 elements. That is, all values are between 0 and 1, and together they all sum to 1.

Why do we turn things into a probability distribution? So we can use negative log likelihood for our loss function as it expects probabilities. PyTorch calculates negative log likelihood for a single example via:

$$\text{negative log likelihood }(\mathbf{\hat{y}}, y) = -\log \big( \text{softmax}(\mathbf{\hat{y}})[y] \big)$$

$\mathbf{\hat{y}}$ is the $\mathbb{R}^{10}$ output, from our neural network, whereas $y$ is the label, an integer representing the class. The loss is the negative log of the class index of the softmax. For example:

$$\mathbf{\hat{y}} = [5,1,1,1,1,1,1,1,1,1]$$$$\text{softmax }(\mathbf{\hat{y}}) = [0.8585, 0.0157, 0.0157, 0.0157, 0.0157, 0.0157, 0.0157, 0.0157, 0.0157, 0.0157]$$

If the label was class zero, the loss would be:

$$\text{negative log likelihood }(\mathbf{\hat{y}}, 0) = - \log(0.8585) = 0.153 \dots$$

If the label was class five, the loss would be:

$$\text{negative log likelihood }(\mathbf{\hat{y}}, 5) = - \log(0.0157) = 4.154 \dots$$

So, intuitively, as your model's output corresponding to the correct class index increases your loss decreases.

We then define device. This is used to place your model and data on to a GPU, if you have one.

We place our model and criterion on to the device by using the .to method.

Next, we'll define a function to calculate the accuracy of our model. This takes the index of the highest value for your prediction and compares it against the actual class label. We then divide how many our model got correct by the amount in the batch to calculate accuracy across the batch.

We finally define our training loop.

This will:

Some layers act differently when training and evaluating the model that contains them, hence why we must tell our model we are in "training" mode. The model we are using here does not use any of those layers, however it is good practice to get used to putting your model in training mode.

The evaluation loop is similar to the training loop. The differences are:

torch.no_grad() ensures that gradients are not calculated for whatever is inside the with block. As our model will not have to calculate gradients it will be faster and use less memory.

The final step before training is to define a small function to tell us how long an epoch took.

We're finally ready to train!

During each epoch we calculate the training loss and accuracy, followed by the validation loss and accuracy. We then check if the validation loss achieved is the best validation loss we have seen. If so, we save our model's parameters (called a state_dict).

Afterwards, we load our the parameters of the model that achieved the best validation loss and then use this to evaluate our model on the test set.

Our model achieves 55.93 % accuracy on the test set.

This can be improved by tweaking hyperparameters, e.g. number of layers, number of neurons per layer, optimization algorithm used, learning rate, etc.

Examining the Model

Now we've trained our model there's a few things we can look at. Most of these are simple exploratory analysis, but they can offer some insights into your model.

An important thing to do is check what examples your model gets wrong and ensure that they're reasonable mistakes.

The function below will return the model's predictions over a given dataset. It will return the inputs (image) the outputs (model predictions) and the ground truth labels.

We can then get these predictions and, by taking the index of the highest predicted probability, get the predicted labels.

Then, we can make a confusion matrix from our actual labels and our predicted labels.

The results seem reasonable enough, the most confused predictions-actuals are: 3-5 and 2-7.

Next, for each of our examples, we can check if our predicted label matches our actual label.

We can then loop through all of the examples over our model's predictions and store all the examples the model got incorrect into an array.

Then, we sort these incorrect examples by how confident they were, with the most confident being first.

We can then plot the incorrectly predicted images along with how confident they were on the actual label and how confident they were at the incorrect label.

Another thing we can do is get the output and intermediate representations from the model and try to visualize them.

The function below loops through the provided dataset and gets the output from the model and the intermediate representation from the layer before that, the second hidden layer.

We run the function to get the representations.

The data we want to visualize is in ten dimensions and 100 dimensions. We want to get this down to two dimensions so we can actually plot it.

The first technique we'll use is PCA (principal component analysis). First, we'll define a function to calculate the PCA of our data and then we'll define a function to plot it.

First, we plot the representations from the ten dimensional output layer, reduced down to two dimensions.

Next, we'll plot the outputs of the second hidden layer.

The clusters seem similar to the one above. In fact if we rotated the below image anti-clockwise it wouldn't be too far off the PCA of the output representations.

An alternative to PCA is t-SNE (t-distributed stochastic neighbor embedding).

This is commonly thought of as being "better" than PCA, although it can be misinterpreted.

t-SNE is very slow, so we only compute it on a subset of the representations.

The classes look very well separated, and it is possible to use k-NN on this representation to achieve decent accuracy.

We plot the intermediate representations on the same subset.

Again, the classes look well separated, though less so than the output representations. This is because these representations are intermediate features that the neural network has extracted and will use them in further layers to weigh up the evidence of what digit is in the image. Hence, in theory, the classes should become more separated the closer we are to the output layer, which is exactly what we see here.

Another experiment we can do is try and generate fake digits.

The function below will repeatedly generate random noise and feed it through the model and find the most confidently generated digit for the desired class.

Finally, we can plot the weights in the first layer of our model.

The hope is that there's maybe one neuron in this first layer that's learned to look for certain patterns in the input and thus has high weight values indicating this pattern. If we then plot these weights we should see these patterns.

Looking at these weights we see a few of them look like random noise but some of them do have weird patterns within them. These patterns show "ghostly" image looking shapes, but are clearly not images.

Conclusions

In this notebook we have shown:

In the next notebook we'll implement a convolutional neural network (CNN) and evaluate it on the MNIST dataset.

MLP Model For Regression Models

Build another PyTorch model for regression (using a multilayer perceptron (MLP)) with 4 target values [y_1, y_2, y_3, y_4] corresponding to [x, y, w, h] of the bounding box containing the object of interest).

Spliting Data Into Train And Test

Multi head detector

Build a multi-headed cat-dog detector using the OOP API in PyTorch with a combined loss function: CXE + MSE.

We have achieved the test accuracy of 56.42%

Pipeline for classification and regression

Image Classification Pipeline + GridSearchCV With Drop Out

MLP Regression Pipeline + GridSearchCV With Drop Out

Multi Head Cat And Dog Detector with 3 hidden layers With Activation : Relu

Multi Head Cat And Dog Detector with 3 hidden layers With Activation : Leaky Relu

Experiment Analysis:

Phase 1: SKLEARN BASELINE MODELS image.png

Phase 2:

image-2.png

Problems/Challenges Faced

Conclusions:

PHASE 3

image.png

Transfer Learning For Object Detection And Fine-Tuning Using EfficientDet (D0-D7) For Cats and Dogs Detection

image-4.png The main important reason for the raise of eficientdet is for the lesser number of parametres , it will gie more accuracy compared to other CNN. If we sacle the three factors (depth, width, resolution) separately, for the bigger networks, the accurcay curve will fall apart quickly, which results in the problem of diminising gradient. for the accuracy comparisions, look for the snapshot below. image-5.png

EFFICIENT DET D7

FULLY CONVOLUTIONAL NEURAL NETWORK (fcn) FOR A SINGLE OBJECT CLASSIFIER AND DETECTOR:

FCN_pic.PNG

FCN Classification

Fully Convoluted Neural Network:

TENSOR BOARD IMAGE OUTPUTS:

train_acc_class_fcn.PNG

train_loss_class_fcn.PNG

test_acc_class_fcn.PNG

test_loss_class_fcn.PNG

Total_loss_fcn_multi.PNG

FCN Fully Convolutional Neural Network For A Single Object Classifier And Detector

TENSOR BOARD:

FCN_Class_Multi.PNG

Regg_loss_FCN_multi.PNG

Train MSE Calculation:

Test MSE Calculation:

RESULTS :

image.png

image-2.png

CONCLUSION :

Problems Faced:

image.png